This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how certain aspects of education have evolved over time and how changes in one variable may influence changes in another. We will analyze each variable across different categories and explore which trends emerge from doing so. By focusing on longitudinal, single-nation data, we aim to identify potential patterns and meaningful interactions that could offer insight into how the attendance and enrollment of students has changed before and after COVID-19 pandemic in the United States public school system. From there, we will draw hypotheses about why these trends may have arisen.
The analysis of various indicators allows us to examine temporal
changes and potential external factors that might shape educational
outcomes.
We focus on two major indicator categories that reflect
different dimensions of the education system. These indicators will be
explored individually and together.
Enrollment Rate
We will compare student enrollment rates across different levels of
schooling (elementary, middle, and high school) and different
years.
Attendance Rate
We will also compare student attendance rates across different levels of
schooling (elementary, middle, and high school) and different
years.
Education is a topic that all of us are connected to as students who grew up in the public school system, and one that our classmates and readers can easily relate to. While a focus on Emory-specific data would offer localized information, broadening our scope to the World Bank’s Education Statistics offers larger context and long-term coverage.
We focus on data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment and attendance can show how nation-wide policy changes and major events shape these measures. This matters to us as students who moved through all three levels of schooling and saw hour own efforts, most notably in high school, would determine our outcomes later in life.
Attendance and enrollment jump out to us as variables of interest due to their potential interaction with one another. Because we experienced schooling during COVID, we examine whether national trends reflect similar shifts. We test whether similar shifts appear at the national level.
Prior literature shows declines in attendance during the pandemic with a minimal recovery in the following years years, alongside temporary decreases in enrollment. Our analysis extends that work by looking into whether these patterns persist at a national level.
Our analysis draws on the World Bank’s Education Statistics (EdStats) dataset, a comprehensive collection of global indicators on education access, participation, efficiency, outcomes, and expenditure. We specifically use the United States subset, covering indicators related to enrollment and attendance. The data span multiple decades, allowing us an examination of both long-term structural trends and shorter-term policy effects.
We narrow our analysis to only sex-combined variables, as each of them were present with male-only, female-only, and male and female data. Given the scope of our project, it makes more sense to explore the sex-combined variables in more depth rather than three less detailed explorations of all three levels.
We will perform Exploratory Data Analysis (EDA) in R
using the tidyverse suite of libraries along with some
supplementary packages for smoother visualization.
Our analysis will include the following components:
The “r setup” code chunk applies to every code chunk in the file, so this will allow us to create a “global environment” to use in our analysis. First, we install and load all necessary packages, then set global options for reproducibility and consistent output formatting. We also create functions that will streamline table formatting for data any outputs of interest.
# install.packages("kableExtra")
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("plotly")
# install.packages("rmarkdown")
# install.packages("broom")
# install.packages("ggpubr")
# install.packages("janitor")
library(kableExtra)
library(tidyverse)
library(readr)
library(knitr)
library(plotly)
library(janitor)
library(lubridate)
library(rmarkdown)
library(broom)
library(ggpubr)
#functions to format dts
pretty_r2_table <- function(df, caption_text){
df %>%
kable("html", caption = caption_text) %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE) %>%
column_spec(3, color = ifelse(df$p.value < 0.05, "red", "black"))
}
pretty_cor_table <- function(x, y, var1_name = NULL, var2_name = NULL) {
cor_res <- cor.test(x, y)
tibble(
Variable1 = var1_name,
Variable2 = var2_name,
Correlation = round(cor_res$estimate, 3),
p.value = signif(cor_res$p.value, 3)) %>%
kable("html", caption = paste0("Correlation: ", var1_name, " vs ", var2_name)) %>%
kable_styling(full_width = F) %>%
row_spec(0, bold = TRUE) %>%
column_spec(4, color = ifelse(cor_res$p.value < 0.05, "red", "black"))
}
set.seed(123)
opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
This dataset is very large and contains data from various countries.
As previously mentioned, we are focusing only on the US data. We start
from reading the raw EdStats dataset (EdStats_v01.csv) and
filter for only the United States data
(Country code == "USA"). We also drop fully empty columns
(2024 column) as they will not be needed in our analysis.
As a check, we print any columns that contain all NA values and proceed
only if the the output is empty. Then, we write the filtered dataset to
EdStats_USA.csv so that later chunks can work only with our
US data without filtering for it multiple times.
This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.
#set up data for use
#read data, skip empty 2024 column
data <- read_csv("data/EdStats_v01.csv", show_col_types=FALSE, col_types=cols(`2024`=col_skip()))
# filter for only USA data
data_clean <- data %>%
filter(grepl("USA", `Country code`))
#make sure no NA cols remain
if (length(colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]) == 0) {
message("No NA cols remain.")
} else {
message("NA columns remain.")
}
## No NA cols remain.
#save filtered dataset
write_csv(data_clean, "data/EdStats_USA.csv")
To keep downstream steps tidy and reproducible, we create a thematic subset aligned to our research questions:
EdStats_attend.csv.This ensures we work with clean, directly comparable variables.
# Enrollment vs. Attendance
# filter for enrollment/attendance info, select only sex-combined data
data <- read_csv("data/EdStats_USA.csv", show_col_types = FALSE)
data_filter_attend <- data %>%
filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) #%>% view() %>%
# write_csv("data/EdStats_attend.csv")
The following chunk expands our analysis of
EdStats_attend.csv to quantify and visualize the
relationship between school enrollment and
attendance across education levels in the United
States.
After reshaping the dataset into tidy long form, we standardize
indicator names to capture both spellings of “enrolment/enrollment” and
collapse duplicate series to obtain a single observation for each
year × level × type combination. The long format allows us to
have columns for country name and country code (only really needed for
reproducibility), variable name, variable value, year, type (simplified
variable name: attendance/enrollment), and level (renamed: primary
school to elementary school, lower secondary to middle school, and upper
secondary to high school). Previously, the “indicator column” contained
school type, school level, and sex all in one string. With this new
format, we can plot variables in ggplot with column names.
tendency_attend.csv for
reproducibility.In interpreting the following plots, we focus on whether attendance consistently trails enrollment and whether that gap changes across levels or decades. High correlation would suggest structural progress, while a persistent gap would indicate that access alone does not guarantee participation.
data_attend <- read_csv("data/EdStats_attend.csv", show_col_types = FALSE)
#long format
data_long_attend <- data_attend %>%
pivot_longer(cols=`1975`:`2022`, names_to="Year", values_to="Value") %>%
mutate(
Year = as.numeric(Year),
Type = case_when(
grepl("attendance", `Indicator name`, ignore.case=TRUE) ~ "Attendance",
grepl("enrolment", `Indicator name`, ignore.case=TRUE) ~ "Enrollment",
TRUE ~ "Other"
),
Level = case_when(
grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
TRUE ~ "Other"
)
) %>%
filter(Type != "Other", !is.na(Value)) %>%
mutate(
Level = factor(Level, levels = c("Elementary School", "Middle School", "High School")),
Type = factor(Type, levels = c("Attendance", "Enrollment"))
)
# write.csv(data_long_attend, "data_long_attend.csv")
#scatter plot to show distribution of pre-filtered data
ggplot(data_long_attend, aes(x=factor(Year), y=Value, color=Type)) +
geom_jitter(width=0.1, alpha=0.7, size=3) +
theme_minimal() +
labs(
title="Distribution of Attendance vs Enrollment Rates by Year (Pre-filtering)",
x="Year",
y="Rate (%)",
color="Type"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#boxplot to show distribution of pre-filtered data
ggplot(data_long_attend, aes(x=factor(Year), y=Value, fill=Type)) +
geom_boxplot(alpha=0.5, outlier.shape = 16, outlier.color="red") +
theme_minimal() +
labs(title="Distribution of Attendance vs Enrollment Rates by Year (Pre-filtering)", x="Year", y="Rate (%)") +
theme(axis.text.x = element_text(angle=45, hjust=1))
The above plots show the distribution of the data before we focus on years with data for both variables. As can be seen in the scatter plot and box plot, the earlier years contain only enrollment data, while attendance data is added around 2005. Ggplot uses interquartile range (IQR) to calculate outliers, which works for our dual-variable analysis, so we do not need to use any multivariable outlier calculation techniques. Any outliers would be marked with red dots in the box plot, and seeing as there are none, we do not need to remove any outliers.
We have data in long format, and next we make two more dataframes: one in wider format and one in a combined format. The wider dataframe is only wider in comparison to the long df, not the raw data. It separates the “type” column (attendance and enrollment rates) from the long data frame into two columns, one for enrollment rate and one for attendance rate. However, this creates NA values the rates are not lined up by year and level. Thus, we create a final data combined dataframe that addresses this issue and lines up year, level, attendance, and enrollment. The data combined contains only the values have data for both attendance and enrollment in the same year and level, which leaves us with substantially less data than we started with. However, this is necessary for accurate plotting and calculations.
Please note that all variables and dataframes are names x_attend but contain both attendance and enrollment data.
data_attend <- read_csv("data/data_long_attend.csv", show_col_types = FALSE)
#select years with most observations
data_attend_years <- data_attend %>%
filter(!Year %in% c(1975, 1986, 1987, 1990, 1991, 1993, 1994, 1995, 1996, 1999))
write_csv(data_attend_years, "data/data_long_attend.csv")
#wide format
data_wide_attend <- data_attend_years %>%
pivot_wider(
names_from=Type,
values_from=Value
) #%>% view()
#combined format
data_combined_attend <- data_wide_attend %>%
group_by(Level, Year) %>%
summarise(
Attendance = Attendance[!is.na(Attendance)],
Enrollment = Enrollment[!is.na(Enrollment)]
) %>%
ungroup() %>%
mutate(Level = factor(Level, levels = c("Elementary School", "Middle School", "High School"))) #%>% view()
# write_csv(data_combined_attend, "data/data_combined_attend.csv")
The scatter and box plots below show the distribution of the data after filtering out the years without enough attendance data. The number of observations in either variable is now closer to being equal, which is ideal for our next analysis steps.
data_long <- read_csv("data/data_long_attend.csv", show_col_types = FALSE) %>%
mutate(
Level = factor(Level, levels = c("Elementary School", "Middle School", "High School")),
Type = factor(Type, levels = c("Attendance", "Enrollment"))
)
#scatter plot to show distribution of post-filtered data
ggplot(data_long, aes(x=factor(Year), y=Value, color=Type)) +
geom_jitter(width=0.1, alpha=0.7, size=3) +
theme_minimal() +
labs(
title="Distribution of Attendance vs Enrollment Rates by Year (Post-filtering)",
x="Year",
y="Rate (%)",
color="Type"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#boxplot to show distribution of post-filtered data
ggplot(data_long, aes(x=factor(Year), y=Value, fill=Type)) +
geom_boxplot(alpha=0.5, outlier.shape = 16, outlier.color="red") +
theme_minimal() +
labs(title="Distribution of Attendance vs Enrollment Rates by Year (Post-filtering)", x="Year", y="Rate (%)") +
theme(axis.text.x = element_text(angle=45, hjust=1))
We now look at the trends in the data. Using our long and combined datasets, we can build visuals to display how attendance and enrollment rates change as time progresses. We re-factor the data after loading it in so that the variables are ordered correctly in the plots, and then build the plots themselves.
The following three plots presents annual attendance and enrollment rates for US elementary, middle, and high schools. Each color represents a different variable as can be seen in the legend. Each panel shows how these two metrics evolve side by side within a given school level. Enrollment rates include more fluctuation than attendance rates for all three levels of schools, especially in high schools. Please note that attendance data is available one year later than enrollment data.
Readers can hover over the plots to see specific data point information, zoom in or out, or double click a variable on the legend to see only the data for the variable.
data_long <- read_csv("data/data_long_attend.csv", show_col_types = FALSE) %>%
mutate(
Level = factor(Level, levels = c("Elementary School", "Middle School", "High School")),
Type = factor(Type, levels = c("Attendance", "Enrollment"))
)
p_line_per_school_horizontal <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
geom_line(size=1.2, na.rm=TRUE) +
geom_point(size=2, alpha=0.8) +
facet_grid(~Level, scales="fixed", switch="y") +
theme_minimal() +
theme(
strip.background=element_rect(fill="grey90", color=NA),
panel.spacing=unit(1, "lines")
) +
labs(
title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
y="Rate (%)",
color="Variable"
) +
scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
)
# ggsave("line_per_school_horizontal.png", p_line_per_school_horizontal)
p_line_interactive1 <- ggplotly(p_line_per_school_horizontal)
p_line_interactive1
Next, we visualized how enrollment trends changed over time across the three different school levels. Each color represents the variable shown in the legend, each line represents the linear trend, and shaded regions show 95% confidence intervals. We see a steady increase in high schools, which may suggest possible demographic shifts or improved retention in upper grades. In contrast, enrollment in elementary schools slightly decreased over time, indicating declining birth cohorts feeding into early education. Middle school enrollments remained relatively stable. It is important to note the COVID quarantine started in early 2020, and both Elementary and Middle school enrollment was lowest the following year.
Readers can hover over the plot to see specific data point information, zoom in or out, or double click a variable on the legend to see only the data for the variable.
data_combined <- read_csv("data/data_combined_attend.csv", show_col_types = FALSE) %>%
mutate(
Level = factor(Level, levels = c("Elementary School", "Middle School", "High School"))
)
enroll_longitudinal <- ggplot(data_combined, aes(x=Year, y=Enrollment, color=Level)) +
geom_point(size=2) +
geom_smooth(method=lm, size=1, alpha=0.25) +
theme_minimal() +
labs(title=" Enrollment Observations Longitudinally",
x="Year",
y="Entrollemnt",
color="Level")
enroll_longitudinal_plotly <- ggplotly(enroll_longitudinal)
enroll_longitudinal_plotly
We also visualized how attendance trends changed over time across the 3 different types of schools, which can be seen in the plot below. Each color represents the variable shown in the legend, each line line represents the linear trend, and shaded regions show 95% confidence intervals. We see gradual downward trends in all three school types, indicating that attendance has slightly declined from 2014 to 2021. Among the three school types, middle schools consistently maintain the highest attendance rates followed by elementary and then high schools. The decline was steepest in elementary schools, suggesting attendance among younger students may have been more sensitive to disruptions such as policy shifts or COVID-19 pandemic.
Readers can hover over the plots to see specific data point information, zoom in or out, or double click a variable on the legend to see only the data for the variable.
data_combined <- read_csv("data/data_combined_attend.csv", show_col_types = FALSE) %>%
mutate(
Level = factor(Level, levels = c("Elementary School", "Middle School", "High School"))
)
attend_longitudinal <- ggplot(data_combined, aes(x=Year, y=Attendance, color=Level)) +
geom_point(size=2) +
geom_smooth(method=lm, size=1, alpha=0.25) +
theme_minimal() +
labs(title=" Attendance Observations Longitudinally",
x="Year",
y="Attendance",
color="Level")
attend_longitudinal_plotly <- ggplotly(attend_longitudinal)
attend_longitudinal_plotly
The following statistics table summarizes the central tendency and variability of attendance and enrollment rates across US elementary, middle, and high schools. Click the arrow in the top right corner to see the rest of the values.
For attendance, mean rates range from 96.4% to 97.7%, indicating high participation levels. Elementary schools exhibit the widest range (3.39%) and the highest variability (SD = 1.32). This suggests that there is greater fluctuation in attendance among younger students. For enrollment, mean rates range from 94.96% to 99.2%, indicating generally high enrollment rates. Here, high schools display the largest spread (range = 7.38) and the highest variance (4.79). While attendance remained uniformly high across all levels, enrollment volatility increased with school level.
In the simple scatter plot below the table, the mean values of the enrollment and attendance observations are displayed with coloring by school type. This plot shows that high schools tends to have lower rates of both attendance and enrollment, while middle schools have the highest average attendance and enrollment rates. This is interesting becuase high school is typically when students have more freedom to choose their own rates of attendance and enrollment, while in middle and elementary school these are typically controlled by the student’s parents.
summary_stats <- data_long %>%
group_by(Type, Level) %>%
summarise(
n = n(),
Mean = mean(Value, na.rm=TRUE),
Median = median(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE),
Variance = var(Value, na.rm=TRUE),
Minimum = min(Value, na.rm=TRUE),
Maximum = max(Value, na.rm=TRUE),
Range = Maximum - Minimum,
Q1 = quantile(Value, 0.25, na.rm=TRUE),
Q3 = quantile(Value, 0.75, na.rm=TRUE),
IQR = Q3 - Q1,
) %>%
arrange(Type, Level) %>% view()
paged_table(summary_stats)
# write_csv(summary_stats, "data/tendency_attend.csv")
mean_plot <- ggplot(summary_stats, aes(x=Mean, y=Type, color=Level)) +
geom_point(size=4) +
theme_minimal() +
labs(title="Means of Attendance and Enrollment Observations",
x="Mean",
y="Type (Attendance/Enrollment)",
color="Level")
mean_plot
Next, we calculate R^2 and p-values for attendance and enrollment data combined. This did not return anything significant, and given that we only have three levels (elementary, middle, high school) per year we don’t have enough statistical power to run the model filtered in that way. If we filter by school level, we have 5 observations per level. It is important to note that while n=5 is technically valid, it is very low for significance testing and therefore should be taken with a grain of salt.
Overall, we did not find any significant results so we cannot conclude that enrollment predicts attendance in this dataset.
data_combined <- read_csv("data/data_combined_attend.csv", show_col_types = FALSE)
model <- lm(Enrollment ~ Attendance, data=data_combined)
model_glance <- glance(model) %>%
select(r.squared, p.value)
model_glance <- model_glance %>% mutate(Model = "Enrollment ~ Attendance")
model_glance <- model_glance %>% select(Model, everything())
pretty_r2_table(model_glance, "R² and p-value for Enrollment ~ Attendance Overall (Red = significant, Black = not significant)")
| Model | r.squared | p.value |
|---|---|---|
| Enrollment ~ Attendance | 0.1097253 | 0.2278019 |
by_schooling_r2 <- data_combined %>%
group_by(Level) %>%
do(glance(lm(Enrollment ~ Attendance, data=.))) %>%
select(Level,r.squared,p.value)
pretty_r2_table(by_schooling_r2, "R² and p-values by School Level (Red = significant, Black = not significant)")
| Level | r.squared | p.value |
|---|---|---|
| Elementary School | 0.1355655 | 0.5420218 |
| High School | 0.2899115 | 0.3491842 |
| Middle School | 0.0677148 | 0.6724550 |
We also looked into elementary school, middle school, and high school regression models. Across both elementary and middle education levels, attendance and enrollment rates show a weak linear relationship. The R2 values of 0.136 and 0.367 suggest that enrollment explains only a small part of the variance in attendance rates. Additionally, both models had p-values well over 0.05 which indicate statistical insignificance. Interestingly, the directions of association differ in elementary and middle school levels. While both variables increase together in elementary school, an increase in enrollment seems to indicate a decrease in attendance rates in middle school. These opposing slopes imply that factors influencing attendance may differ between younger and older students. High school results were slightly more significant compared to the other two levels, but not to the extent that they can be to considered statistically significant. The linear relationship at the level of schooling is also slightly stronger than the other two, but not by a significant factor. The direction of linearity in high school matches that of middle schools, or in other words an increase in one variable causes a decrease in another.
data_long_attend <- read_csv("data/data_long_attend.csv", show_col_types = FALSE) %>%
mutate(
Level = factor(Level, levels = c("Elementary School", "Middle School", "High School")),
Type = factor(Type, levels = c("Attendance", "Enrollment"))
)
data_wide <- data_long_attend %>%
select(Year, Level, Type, Value) %>%
pivot_wider(names_from = Type, values_from = Value) %>%
arrange(Level, Year) %>%
filter(!is.na(Attendance), !is.na(Enrollment))
#regression
glance_by_level <- data_wide %>%
tidyr::nest(data = -Level) %>%
mutate(model = purrr::map(data, ~ lm(Attendance ~ Enrollment, data = .x)),
stats = purrr::map(model, broom::glance)) %>%
tidyr::unnest(stats)
# First, create the text labels from our model summary table
panel_labels <- glance_by_level %>%
transmute(
Level,
label = paste0("R² = ", round(r.squared, 3), "\np = ", signif(p.value, 3))
)
# Determine the best position for the labels on each plot
label_pos <- data_wide %>%
group_by(Level) %>%
summarise(x = min(Enrollment, na.rm=T), y = max(Attendance, na.rm=T)) %>%
left_join(panel_labels, by = "Level")
p_reg <- ggplot(data_wide, aes(x = Enrollment, y = Attendance)) +
geom_point(aes(color = Level), alpha = 0.8, show.legend = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
facet_wrap(~Level, scales = "free") +
geom_label(data = label_pos, aes(x = x, y = y, label = label),
hjust = 0, vjust = 1, size = 3.5, label.padding = unit(0.3, "lines")) +
labs(
title = "Attendance vs. Enrollment with OLS Fits by School Level (USA)",
subtitle = "Labels show R-squared and model p-value for each level-specific regression",
x = "Enrollment Rate (%)", y = "Attendance Rate (%)"
) +
theme_minimal(base_size = 14) +
theme(strip.text = element_text(face = "bold"))
print(p_reg)
Middle and high school enrollment rates are higher than they were in 2005, and while their attendance rates took the dip in 2020 they look be increasing back to where they were prior to COVID. Elementary school enrollment rates are lower now than they have been since before 2005, and have been falling since 2016. This could be due to a number of factors, such as policies that de-fund primary public schools or a shift towards private schooling and homeschooling.
Further analysis revealed that mean enrollment rates were consistently higher than mean attendance rates across school levels. This is backed up by our visualizations earlier in the analysis that revealed the same trend. This may point to structural barriers beyond access.
We did not find any statistical significance between attendance and enrollment rates, however with the amount of data we had post-filtering we were limited in how we could partition the data to search for smaller trends. We actually did find statistical significance in the decreasing attendance rates over time, as well as a relatively strong negative linear relationship, but we were unable to include that and fully explain it while staying within the word count for this assignment. We elected to exclude that table and analysis in favor of further exploratory analysis.
Finally, we ran a model for linear regression by school type. Elementary schools showed a weak positive linear relationship between enrollment and attendance, while middle and high schools showed a weak negative linear relationship. This was unexpected as a negative linear relationship between these variables implies that as one variable increases the other will decrease, and we expect that enrollment and attendance increase/decrease together. Given that we did not find any statistical significance between the two variables, it is entirely plausible that they could have a negative linear relationship.
Future work could extend this analysis in several directions. Examining attendance and enrollment at the state or district level would increase statistical power and may reveal heterogeneity masked in national averages. Incorporating policy-level data - such as attendance reporting requirements, funding changes, or shifts to hybrid instruction - could help distinguish behavioral change from measurement change during the pandemic period. Extending the dataset beyond 2022 would also clarify whether post-COVID attendance patterns represent a temporary disruption or a longer-term structural shift.
Reproducibility Notes
All intermediate datasets and figures are written to disk
(EdStats_USA.csv, EdStats_attend.csv,
data_attend.csv, data_long_attend.csv,
data_combined.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating
earlier processing.
Consider adding sessionInfo() at the end to document
package versions.
World Bank Group. Education Statistics. World Bank Data Catalog, 2022, https://datacatalog.worldbank.org/search/dataset/0038480/Education-Statistics.